feat(moe): add bf16 DeepEP-normal MoE path via DeepGEMM grouped GEMM by Tanmo-ai · Pull Request #1111 · alibaba/rtp-llm

Tanmo-ai · 2026-06-17T01:59:33Z

Summary

Add an opt-in expert path for unquantized (bf16) MoE under DeepEP normal mode
that uses DeepGEMM grouped GEMM instead of the Triton fused_moe_kernel.

Changes

DeepGemmBf16HybridExecutor: runtime-dispatches between a masked 3D layout
(small token count / decode) and a contiguous flat layout (large token count /
prefill) for better memory utilization.
ep_scatter_bf16 / ep_scatter_v2_bf16: bf16 variants of the existing fp8
scatter kernels (flat → contiguous, flat → 3D masked).
CudaNoQuantDpNormalDeepGemmStrategy: opt-in only, selected via
--moe_strategy no_quant_dp_normal_deepgemm, gated on bf16 + has_deep_gemm +
SM≥9 + no CUDA graph. It is not part of "auto" selection, so the default
MoE path on existing deployments is unchanged.

deepgemm_wrapper.py changes (backward-compatible)

These changes enable the bf16 grouped-GEMM path and do not affect existing
fp8/bf16 callers:

bf16 fallback symbol names corrected to the real deep_gemm symbols
(gemm_bf16_bf16_bf16_nt*). resolve_symbol() tries the standard name first
and only falls back, so existing resolution is unchanged — this only makes the
previously dormant bf16 path resolvable. The stale compiled_dims argument is
dropped from the contiguous/masked bf16 calls to match the actual deep_gemm
signature.
Symbol resolution deferred to first use (_ensure_initialized) instead of
at import time. Functionally identical, only lazy: the same symbols are
resolved, just on the first actual GEMM call.
has_deep_gemm() re-checks until the first successful import (then caches
True) instead of caching the first result. For normal processes where
deep_gemm is importable at import time it returns True on the first call
exactly as before; this only adds resilience when the package becomes
importable slightly later, and does not change existing behavior.

Testing

New unit test test_ep_scatter_bf16.py covers bf16 scatter kernel correctness.
Existing fp8/bf16 MoE paths and default auto-selection are unchanged.

CLAassistant · 2026-06-17T01:59:40Z

All committers have signed the CLA.

LLLLKKKK · 2026-06-17T02:09:14Z

AI Code Review - PR #1111

Status: BLOCKING

Summary: P0/0 · P1/1 · P2/1 · P3/0

Blocking Issues

P1

contiguous scatter 测试分配容量不足会越界写 @ rtp_llm/models_py/triton_kernels/moe/test/test_ep_scatter_bf16.py:315
- 建议：按 recv_topk 统计每个 expert 的真实 token 数并 align 后再分配，或降低 token_num/topk，确保 sum(aligned_counts) 覆盖所有有效写入。

Non-blocking Suggestions

P2

新增 BF16 DeepGEMM executor 缺少端到端正确性覆盖 @ rtp_llm/models_py/modules/factory/fused_moe/impl/cuda/executors/deepgemm_bf16_hybrid_executor.py:158
- 建议：补充 executor 级 BF16 no-quant 测试，覆盖 token_num 小于和大于阈值两条路径、strategy 显式选择，并校验 router weight 后的输出。

Checklist Violations (5 fail / 104 total)

General Principles Checklist

[6.1] Tests — 新逻辑有聚焦单测 + 相关集成/smoke 测试 → issue 新增 BF16 DeepGEMM executor 缺少端到端正确性覆盖
新增 executor/strategy 缺少端到端正确性测试；现有新增测试只覆盖 scatter kernel，未覆盖完整 GEMM/activation/gather 输出。
[6.1] Tests — 边界 case 覆盖（空、单元素、最大值） → issue contiguous scatter 测试分配容量不足会越界写
test_large_random 的 per-expert 容量固定为 128，合计 1024 slot 小于 2048 个有效 assignment，边界压测本身可能越界。
[6.1] Tests — 分布式/跨平台变更有对应覆盖 → issue 新增 BF16 DeepGEMM executor 缺少端到端正确性覆盖
新路径用于 DeepEP normal 分布式 MoE，但未看到 executor 级 EP/strategy 覆盖。

RTP-LLM Checklist

[B] 正确性与逻辑 — CUDA kernel batch 索引维度与 host buffer shape 匹配 → issue contiguous scatter 测试分配容量不足会越界写
新增 contiguous scatter 压测传入的 host buffer 容量与 kernel 实际写入规模不匹配，可能越界写。
[H] 测试与 CI — 测试覆盖充分：大重构等价覆盖，新功能端到端测试 → issue 新增 BF16 DeepGEMM executor 缺少端到端正确性覆盖
新增 BF16 DeepGEMM executor/strategy 是新功能，但缺少端到端输出对比测试；scatter large_random 还存在容量错误。

Strengths

新策略保持显式 opt-in，没有进入 auto 选择，降低默认 MoE 路径回归风险。
DeepGEMM wrapper 改为 lazy init，避免可选依赖在模块 import 阶段直接失败。

LLLLKKKK · 2026-06-17T04:03:06Z

AI Code Review - PR #1111

Status: LGTM

Summary: P0/0 · P1/0 · P2/3 · P3/0

lgtm ready to ci

Non-blocking Suggestions

P2

CUDA 计数张量可能落到错误设备 @ rtp_llm/models_py/modules/factory/fused_moe/impl/cuda/executors/deepgemm_bf16_hybrid_executor.py:308
- 建议：改为 .to(device=device, non_blocking=True)，并让 expert_start_loc 跟随同一 device。
BF16 grouped GEMM 的 compiled_dims 参数被静默忽略 @ rtp_llm/models_py/kernels/cuda/deepgemm_wrapper.py:595
- 建议：若底层支持则传递 compiled_dims；否则在 compiled_dims != "nk" 时显式抛错或移除该公开参数。
新增 DeepEP Normal 路径缺少多 rank 覆盖 @ rtp_llm/models_py/modules/factory/fused_moe/impl/cuda/executors/test/deepgemm_bf16_hybrid_executor_test.py:72
- 建议：补充至少一个 ep_size>1 的集成测试，经过 DeepepNormalRouter prepare/finalize，并校验跨 rank 输出与 torch reference 一致。

Checklist Violations (7 fail / 104 total)

General Principles Checklist

[6.1] Architecture — 状态不变量：创建/更新/失败/重试/回滚路径有效 → issue CUDA 计数张量可能落到错误设备
num_recv_tokens_per_expert_gpu 在 executor 中用 .cuda() 依赖当前 device，可能破坏与 hidden_states.device 同设备的不变量。
[6.1] Architecture — 错误语义：fail-fast/retry/fallback/silent 行为显式 → issue BF16 grouped GEMM 的 compiled_dims 参数被静默忽略
m_grouped_bf16_gemm_nt_contiguous/masked 保留 compiled_dims 参数，但传入非默认值时被静默忽略。
[6.1] Architecture — 兼容性：公开 API/持久数据/配置/环境迁移安全 → issue BF16 grouped GEMM 的 compiled_dims 参数被静默忽略
wrapper 公开参数 compiled_dims 未传递到底层实现，调用方会得到与参数含义不一致的行为。
[6.1] Tests — 新逻辑有聚焦单测 + 相关集成/smoke 测试 → issue 新增 DeepEP Normal 路径缺少多 rank 覆盖
已有 kernel/executor 单卡测试，但缺少真实 DeepepNormalRouter 多 rank 集成覆盖。
[6.1] Tests — 分布式/跨平台变更有对应覆盖 → issue 新增 DeepEP Normal 路径缺少多 rank 覆盖
DeepEP Normal 是分布式路径，新增测试固定 ep_size=1，未覆盖跨 rank dispatch/combine。

RTP-LLM Checklist

[E] 分布式 — 跨 rank 数据一致性 → issue 新增 DeepEP Normal 路径缺少多 rank 覆盖
测试未覆盖 EP>1 的跨 rank dispatch/combine，无法验证 global/local expert id 与 per-rank token count 一致性。
[H] 测试与 CI — 测试覆盖充分：大重构等价覆盖，新功能端到端测试 → issue 新增 DeepEP Normal 路径缺少多 rank 覆盖
新增端到端测试只覆盖 ep_size=1 且绕过 router，缺少该 DeepEP Normal strategy 的真实多 rank 覆盖。

Strengths

scatter kernel 测试通过 output_index 做顺序无关校验，避免把 tl.atomic_add 的非确定顺序误判为错误。
DeepGEMM 策略为显式 opt-in，避免改变现有 auto MoE 选择。
新增 BF16 DeepGEMM executor 分别覆盖 masked 与 contiguous 两条计算路径。

LLLLKKKK · 2026-06-17T10:31:04Z

AI Code Review - PR #1111

Status: LGTM

Summary: P0/0 · P1/0 · P2/0 · P3/0

lgtm ready to ci

Checklist ✅ (104 items passed)

Strengths

DeepGEMM BF16 executor 覆盖 masked/contiguous 两条路径，并补充 scatter/gather 与 executor 级端到端测试。

LLLLKKKK · 2026-06-17T12:43:22Z

AI Code Review - PR #1111

Status: LGTM

Summary: P0/0 · P1/0 · P2/2 · P3/0

lgtm ready to ci

Non-blocking Suggestions

P2

BF16 DeepGEMM 符号未在策略选择时校验 @ rtp_llm/models_py/modules/factory/fused_moe/impl/cuda/strategy/no_quant.py:121
- 建议：在 check_conditions 中校验 BF16 grouped GEMM 符号可解析，或新增 has_deep_gemm_bf16_grouped() 能力检查后再允许选中策略。
新增策略缺少工厂选路覆盖 @ rtp_llm/models_py/modules/factory/fused_moe/impl/cuda/strategy/no_quant.py:120
- 建议：补充 strategy/factory 单测，设置 moe_strategy="no_quant_dp_normal_deepgemm"，验证选中 CudaNoQuantDpNormalDeepGemmStrategy 及对应 router/executor。

Checklist Violations (2 fail / 56 total)

General Principles Checklist

[6.1] Architecture — 错误语义：fail-fast/retry/fallback/silent 行为显式 → issue BF16 DeepGEMM 符号未在策略选择时校验
策略选择阶段只检查 deep_gemm 包存在，未检查 BF16 grouped GEMM 符号，错误会延迟到首个执行请求。
[6.1] Tests — 新逻辑有聚焦单测 + 相关集成/smoke 测试 → issue 新增策略缺少工厂选路覆盖
已有 executor/kernel 聚焦测试，但新增 CLI strategy 没有 registry/factory 选路测试。

Strengths

新增 BF16 DeepGEMM 路径保持显式 opt-in，未进入 auto 默认选择，降低了默认 MoE 路径回归风险。
新增 executor 与 scatter kernel 测试覆盖 masked/contiguous 路径，并覆盖 EP rank0/rank1 的 global-to-local expert id 映射。

LLLLKKKK · 2026-06-17T14:46:11Z

AI Code Review - PR #1111

Status: LGTM

Summary: P0/0 · P1/0 · P2/3 · P3/0

lgtm ready to ci

Non-blocking Suggestions

P2

BF16 contiguous padding 行被当作有效 expert 参与 GEMM @ rtp_llm/models_py/triton_kernels/moe/ep_kernels.py:262
- 建议：将 padding 行标记为 m_indices=-1 或显式填零并固定该契约，补充 padding 行测试。
未校验 DeepGEMM BF16 grouped 符号可用性 @ rtp_llm/models_py/modules/factory/fused_moe/impl/cuda/strategy/no_quant.py:121
- 建议：在策略或 wrapper 增加 BF16 grouped symbol 探测，缺失时拒绝该策略或给出明确 fallback。
新增 DeepGEMM 策略缺少选择路径测试 @ rtp_llm/models_py/modules/factory/fused_moe/impl/cuda/strategy/no_quant.py:120
- 建议：在 test_cuda_strategies.py 增加该策略的正向选择和 fp16/auto/cuda_graph 等负向用例。

Checklist Violations (5 fail / 104 total)

General Principles Checklist

[6.1] Architecture — 错误语义：fail-fast/retry/fallback/silent 行为显式 → issue 未校验 DeepGEMM BF16 grouped 符号可用性
策略只检查 deep_gemm 包存在，未检查 BF16 grouped 符号是否可解析；缺失符号会推迟到 executor 运行期失败。
[6.1] Tests — 新逻辑有聚焦单测 + 相关集成/smoke 测试 → issue 新增 DeepGEMM 策略缺少选择路径测试
executor/kernel 有覆盖，但新增 moe_strategy 的 registry/can_handle 正负向选择路径未覆盖。
[6.1] Tests — 边界 case 覆盖（空、单元素、最大值） → issue BF16 contiguous padding 行被当作有效 expert 参与 GEMM
contiguous scatter 测试验证 occupied rows 和 roundtrip，但未断言 padding 行的 m_indices/填零契约。

RTP-LLM Checklist

[A] 兼容性与配置 — 可选依赖 lazy import，pybind 新字段有 C++ 默认值 → issue 未校验 DeepGEMM BF16 grouped 符号可用性
deep_gemm 依赖已 lazy，但策略层只确认包存在，未验证新增 BF16 grouped 符号可用。
[H] 测试与 CI — 测试覆盖充分：大重构等价覆盖，新功能端到端测试 → issue 新增 DeepGEMM 策略缺少选择路径测试
新增 no-quant DeepGEMM 策略是配置入口新功能，缺少策略选择层面的正负用例。

Strengths

新增 BF16 DeepGEMM executor 与 Triton scatter 测试覆盖 masked/contiguous、EP rank0/rank1 和 scatter-gather roundtrip，核心数值路径有针对性验证。

LLLLKKKK · 2026-06-18T03:49:47Z

AI Code Review - PR #1111

Status: BLOCKING

Summary: P0/0 · P1/1 · P2/1 · P3/0

Blocking Issues

P1

新策略探测会让非 opt-in 配置触发 DeepGEMM 符号解析异常 @ rtp_llm/models_py/modules/factory/fused_moe/impl/cuda/strategy/no_quant.py:126
- 建议：在检查 moe_strategy 后显式短路，或让 has_deep_gemm_bf16_grouped 捕获符号解析失败并返回 False。

Non-blocking Suggestions

P2

测试跳过条件未覆盖 BF16 grouped DeepGEMM 符号 @ rtp_llm/models_py/modules/factory/fused_moe/impl/cuda/executors/test/deepgemm_bf16_hybrid_executor_test.py:71
- 建议：改用 has_deep_gemm_bf16_grouped() 作为 skip 条件，并确保该 helper 对缺符号返回 False 而不是抛异常。

Checklist Violations (4 fail / 60 total)

General Principles Checklist

[6.1] Architecture — 错误语义：fail-fast/retry/fallback/silent 行为显式 → issue 新策略探测会让非 opt-in 配置触发 DeepGEMM 符号解析异常
可选 deep_gemm bf16 grouped 符号缺失应让新策略不可用，但当前探测会抛 RuntimeError 并中断策略枚举。
[6.1] Architecture — 兼容性：公开 API/持久数据/配置/环境迁移安全 → issue 新策略探测会让非 opt-in 配置触发 DeepGEMM 符号解析异常
旧 deep_gemm 包缺新 bf16 grouped 符号时，新注册策略的探测可能影响现有 auto 配置初始化。
[6.1] Tests — 新逻辑有聚焦单测 + 相关集成/smoke 测试 → issue 测试跳过条件未覆盖 BF16 grouped DeepGEMM 符号
新增 e2e 测试只按 has_deep_gemm() 跳过，未覆盖策略实际依赖的 BF16 grouped symbols 可用性。

RTP-LLM Checklist

[H] 测试与 CI — 测试覆盖充分：大重构等价覆盖，新功能端到端测试 → issue 测试跳过条件未覆盖 BF16 grouped DeepGEMM 符号
新增 e2e 测试未按 has_deep_gemm_bf16_grouped() 跳过，覆盖条件与策略门控不一致。

Strengths

新增 bf16 DeepGEMM 执行路径设计为显式 opt-in，未在策略意图上纳入 auto 默认选择。
测试覆盖 masked/contiguous 两条执行路径，并包含 scatter/gather 与 EP rank 的 expert id 映射场景。

LLLLKKKK · 2026-06-18T04:43:50Z

AI Code Review - PR #1111

Status: BLOCKING

Summary: P0/0 · P1/1 · P2/1 · P3/0

Blocking Issues

P1

DeepGEMM 初始化把 BF16 符号变成 FP8 路径的硬依赖 @ rtp_llm/models_py/kernels/cuda/deepgemm_wrapper.py:171
- 建议：按调用路径拆分初始化：FP8 wrapper 只解析 FP8 符号，BF16 能力探测只解析 BF16 grouped 符号，缺失时仅让新策略不可选。

Non-blocking Suggestions

P2

BF16 grouped GEMM 的 compiled_dims 参数被静默忽略 @ rtp_llm/models_py/kernels/cuda/deepgemm_wrapper.py:621
- 建议：要么继续把 compiled_dims 传给支持该参数的 deep_gemm 符号，要么移除/显式拒绝该参数，避免静默忽略。

Checklist Violations (3 fail / 104 total)

General Principles Checklist

[6.1] Architecture — 兼容性：公开 API/持久数据/配置/环境迁移安全 → issue DeepGEMM 初始化把 BF16 符号变成 FP8 路径的硬依赖
_ensure_initialized() 统一解析 BF16 grouped 符号会阻断旧 deep_gemm 环境的既有 FP8 wrapper；另有 BF16 grouped compiled_dims 参数静默忽略。

RTP-LLM Checklist

[A] 兼容性与配置 — 可选依赖 lazy import，pybind 新字段有 C++ 默认值 → issue DeepGEMM 初始化把 BF16 符号变成 FP8 路径的硬依赖
deep_gemm 可选依赖虽改为 lazy 初始化，但一次解析所有符号，BF16 grouped 缺失会扩散到既有 FP8 wrapper。
[B] 正确性与逻辑 — 接口返回类型变更有兼容处理 → checklist-only
单个 draft 认为 bf16_gemm_nt legacy fallback 签名不兼容；未达到 issue 保留阈值，作为兼容风险提示保留。

Strengths

新增 BF16 hybrid executor 覆盖 masked/contiguous 两条路径，并补充 scatter/gather、端到端数值与 EP rank 映射测试。

LLLLKKKK · 2026-06-18T05:51:57Z

AI Code Review - PR #1111

Status: BLOCKING

Summary: P0/0 · P1/1 · P2/0 · P3/0

Blocking Issues

P1

空 token 会进入 masked 路径并以 0 grid 启动 Triton @ rtp_llm/models_py/modules/factory/fused_moe/impl/cuda/executors/deepgemm_bf16_hybrid_executor.py:159
- 建议：在分流前处理 token_num==0 或 sum(expert_num_tokens)==0，直接返回同 shape 的空/零 fused_expert_output，并补空 batch/空 rank 测试。

Checklist Violations (5 fail / 104 total)

General Principles Checklist

[6.1] Architecture — 状态不变量：创建/更新/失败/重试/回滚路径有效 → issue 空 token 会进入 masked 路径并以 0 grid 启动 Triton
空 token/空 rank 状态未在分流前处理，会进入 masked 路径并触发 0-grid kernel。
[6.1] Tests — 分布式/跨平台变更有对应覆盖 → issue 空 token 会进入 masked 路径并以 0 grid 启动 Triton
DeepEP 小 batch/路由倾斜可能让某个 rank 的 token_num 为 0，当前测试没有覆盖该边界且实现会继续进入 masked path。
[6.1] Tests — 边界 case 覆盖（空、单元素、最大值） → issue 空 token 会进入 masked 路径并以 0 grid 启动 Triton
DeepEP 小 batch/路由倾斜可能让某个 rank 的 token_num 为 0，当前测试没有覆盖该边界且实现会继续进入 masked path。

RTP-LLM Checklist

[B] 正确性与逻辑 — 边界 case（空输入、单元素、最大值） → issue 空 token 会进入 masked 路径并以 0 grid 启动 Triton
token_num == 0 时 execute() 仍选择 masked path，后续 0 alignment 的 scatter/DeepGEMM 没有保护。
[H] 测试与 CI — 测试覆盖充分：大重构等价覆盖，新功能端到端测试 → issue 空 token 会进入 masked 路径并以 0 grid 启动 Triton
新增 executor/scatter 测试覆盖 masked/contiguous 和 ep2，但没有覆盖本 rank 收到 0 token 的 DeepEP 边界。

Strengths

新增策略为显式 opt-in，避免改变 auto 默认 MoE 路径。
新增测试覆盖 bf16 scatter kernel、DeepGEMM executor masked/contiguous 以及 EP rank0/rank1 主要路径。

LLLLKKKK · 2026-06-18T06:08:16Z

AI Code Review - PR #1111

Status: LGTM

Summary: P0/0 · P1/0 · P2/0 · P3/0

lgtm ready to ci

Checklist ✅ (56 items passed)

Strengths

新增 bf16 DeepGEMM 路径保持显式 opt-in，未进入 auto 默认策略，降低默认路径回归风险。
新增 bf16 scatter/gather 与 executor 测试覆盖 masked、contiguous、EP rank 映射和空 token/空 rank 路径，验证主要数据流。

wht21 · 2026-06-18T06:11:21Z

internal source has been updated, please review the changes!

Replace runtime signature introspection with explicit validation: bf16 grouped GEMM only supports compiled_dims='nk', reject others with ValueError. Matches PR alibaba#1111 approach. Remove _has_param helper and all inspect.signature usage — eliminates unintrospectable callable, **kwargs, and positional-vs-keyword edge cases.

LLLLKKKK · 2026-06-28T02:05:30Z

AI Code Review - PR #1111

Status: LGTM

Summary: P0/0 · P1/0 · P2/3 · P3/1

lgtm ready to ci

Non-blocking Suggestions

P2

activation 参数被接受但未使用，硬编码 SiLU 缺少校验 @ rtp_llm/models_py/modules/factory/fused_moe/impl/cuda/executors/deepgemm_bf16_hybrid_executor.py:137
- 建议：在 check_conditions 或 execute 入口添加 assert activation in ('silu', 'SiGLU', 'swiglu')，确保非 SiLU 模型不会静默使用错误的激活函数。
executor check_conditions 使用弱检查 has_deep_gemm() 而非 has_deep_gemm_bf16_grouped() @ rtp_llm/models_py/modules/factory/fused_moe/impl/cuda/executors/deepgemm_bf16_hybrid_executor.py:91
- 建议：将 executor check_conditions 中的 checker.check(has_deep_gemm()) 改为 checker.check(has_deep_gemm_bf16_grouped())，与 strategy 保持一致。
每次 forward 创建标量 CUDA tensor 带来不必要开销 @ rtp_llm/models_py/modules/factory/fused_moe/impl/cuda/executors/deepgemm_bf16_hybrid_executor.py:134
- 建议：将 torch.tensor(-1, ...) 替换为 Python 整数 -1，torch.where 会自动广播；或在 init 中预创建并复用。

P3

误导性注释：torch.empty 不会零初始化 @ rtp_llm/models_py/modules/factory/fused_moe/impl/cuda/executors/deepgemm_bf16_hybrid_executor.py:334
- 建议：将注释修正为描述实际行为：fwd_kernel_ep_scatter_1 fills m_indices for all aligned slots; clamp guards against any out-of-range values as a safety belt.

Checklist Violations (2 fail / 92 total)

RTP-LLM Checklist

[D] 性能 — hot path per-forward 内存分配 → checklist-only
execute_masked 和 execute_contiguous 中每次 forward 调用均使用 torch.empty 分配多个中间 tensor（input_tensor, upgate_output, down_input, down_output, gather_out 等）。dispose_tensor 及时释放内存但不消除分配开销。对于 opt-in 的非默认路径且当前未兼容 CUDA Graph，per-forward 分配是合理的初始实现，后续性能调优时可预分配。

Python Static-First Checklist

[P.B] 错误处理 — 禁止 bare except 或静默吞异常 → checklist-only
_deepgemm_wrapper.py ensure_bf16_initialized 中 except Exception: pass 静默吞掉所有异常无日志。设计意图是容忍旧 deep_gemm 缺少 bf16 符号，且有注释说明，影响范围仅限 bf16 初始化路径。建议添加 logger.debug 级别日志以增加可追踪性。

Strengths

fp8 和 bf16 符号初始化解耦设计清晰——_ensure_bf16_initialized 独立于 _ensure_initialized，旧版 deep_gemm 缺少 bf16 符号不影响 fp8 路径
hybrid executor 根据 token 数量自动选择 masked/contiguous 布局，decode 小 batch 用低内存 3D layout，prefill 大 batch 用高效 contiguous layout
opt-in 策略设计安全——通过 moe_strategy == 'no_quant_dp_normal_deepgemm' 显式启用，不参与 auto 选路，降低默认路径回归风险
测试覆盖全面：kernel 级 scatter roundtrip + executor 级端到端 EP1/EP2/empty-rank + strategy 级条件检查正反用例 + bf16 初始化容错测试

LLLLKKKK · 2026-06-29T04:00:16Z

AI Code Review - PR #1111

Status: LGTM

Summary: P0/0 · P1/0 · P2/5 · P3/1

lgtm ready to ci

Non-blocking Suggestions

P2

bf16 初始化中 except Exception 捕获范围过宽 @ rtp_llm/models_py/kernels/cuda/deepgemm_wrapper.py:229
- 建议：收窄为 except (RuntimeError, ImportError, AttributeError):，仅捕获 resolve_symbol / import 路径的预期异常类型。保持 never-raise 契约不变，同时让非预期异常暴露。
contiguous 路径 .cpu().tolist() fallback 引入 GPU sync @ rtp_llm/models_py/modules/factory/fused_moe/impl/cuda/executors/deepgemm_bf16_hybrid_executor.py:328
- 建议：在 fallback 路径添加 logging.warning('expert_num_tokens_cpu not provided, triggering GPU sync') 帮助定位意外 sync。低优先级，延续已有模式。
moe_group_args.py 中 no_quant 策略名拼写不一致 @ rtp_llm/server/server_args/moe_group_args.py:169
- 建议：本 PR 的新增拼写正确，无需修改。建议在后续 PR 中统一修正旧条目拼写（需保持旧名称 fallback 兼容），不阻塞本 PR。
contiguous 与 masked 路径 configure_deep_gemm_num_sms 作用域风格不一致 @ rtp_llm/models_py/modules/factory/fused_moe/impl/cuda/executors/deepgemm_bf16_hybrid_executor.py:379
- 建议：将 execute_contiguous 的 Step 2-4 整体包裹在一个 configure_deep_gemm_num_sms 上下文中，与 execute_masked 风格保持一致。silu_and_mul 不受 num_sms 影响，包含在内无副作用。
共享 kernel _fwd_kernel_ep_scatter_1 移除 tl.store mask 降低健壮性 @ rtp_llm/models_py/triton_kernels/moe/ep_kernels.py:39
- 建议：保留 mask=(off_expert + start_m < cur_expert_token_num) 防护；或在 kernel docstring 中明确 per-expert count 必须对齐到 BLOCK_E 的前提条件（当前仅在调用方 ep_scatter L127 有注释），使约束在定义处可见。

P3

诊断测试中使用 print() 输出结果 @ rtp_llm/models_py/triton_kernels/moe/test/test_ep_scatter.py:230
- 建议：改用 logging.info() 记录诊断结果。若仅供本地调试，标记 @unittest.skip('diagnostic only')。

Checklist Violations (3 fail / 92 total)

RTP-LLM Checklist

[D] 性能 — hot path per-forward 内存分配 → checklist-only
executor 的 execute_masked/execute_contiguous 每次 forward 都 torch.empty 分配多个临时 tensor（input_tensor, upgate_output, down_input, down_output, gather_out）。与 FP8 hybrid executor 模式一致，且明确不兼容 CUDA Graph（check_conditions 拒绝 enable_cuda_graph=True）。MoE per-expert token 数每次不同，tensor shape 变化使预分配困难。当前设计合理，延续已有模式。
[D] 性能 — 设备同步 .item()/.cpu()/.tolist() 在关键路径 → issue contiguous 路径 .cpu().tolist() fallback 引入 GPU sync
execute_contiguous 第328行 fallback .cpu().tolist() 在 expert_num_tokens_cpu 为 None 时触发 GPU sync。已有 fast path 优先使用 cpu 副本（第324行），fallback 仅在 cpu 版本不可用时触发。延续 FP8 hybrid executor 相同模式。

Python Static-First Checklist

[P.B] 错误处理 — 禁止 bare except 或静默吞异常 → issue bf16 初始化中 except Exception 捕获范围过宽
_ensure_bf16_initialized() L229 except Exception: pass 和 has_deep_gemm_bf16_grouped() L94 except Exception: return False 捕获范围过宽。设计意图正确（容忍 bf16 符号缺失），但 Exception 同时静默吞掉 TypeError/NameError 等编码 bug。应收窄为 except (RuntimeError, ImportError, AttributeError)。

Strengths

Triton 内核 _fwd_kernel_ep_scatter_1 跨 warp 竞态修复精准：从全局内存 tl.load 改为寄存器级 tl.where+tl.sum 提取 cumsum 值，消除了 256 expert 配置下 store-load 可见性竞态，配合 200 轮 stress test 和 poison-fill 回归测试验证
fp8/bf16 符号初始化完全解耦（_ensure_bf16_initialized 独立于 _ensure_initialized），旧版 deep_gemm 缺少 bf16 grouped 符号时 fp8 路径完全不受影响，向后兼容性处理周到
全面的测试覆盖：executor e2e 覆盖 masked/contiguous 双路径 × 单/多 rank EP × empty rank；scatter 内核覆盖 roundtrip、negative expert id、large random 和 stress；策略覆盖正向与反向用例
策略注册为 opt-in（需显式 --moe_strategy=no_quant_dp_normal_deepgemm），check_conditions 首先检查策略名并 short-circuit，不改变默认 CUDA MoE 路径行为，降低 rollout 风险
deepgemm_wrapper 初始化从模块加载时改为首次使用时惰性加载，消除了 import 时副作用，改善了子进程场景的兼容性

github-actions · 2026-06-29T04:00:28Z

CI dispatcher could not find a native build run for HEAD SHA 92943ce7.

This can happen if the PR was opened before the CI architecture change, or if the original run was deleted.

To fix: push any commit (even empty: git commit --allow-empty -m "trigger CI" && git push) to create a native build run, then re-approve or post lgtm ready to ci.

Add an opt-in expert path for unquantized (bf16) MoE under DeepEP normal mode that uses DeepGEMM grouped GEMM instead of the Triton fused_moe_kernel. - DeepGemmBf16HybridExecutor: runtime-dispatches between a masked 3D layout (small token count / decode) and a contiguous flat layout (large token count / prefill) for better memory utilization. - ep_scatter_bf16 / ep_scatter_v2_bf16: bf16 variants of the existing fp8 scatter kernels (flat -> contiguous, flat -> 3D masked). - CudaNoQuantDpNormalDeepGemmStrategy: opt-in only, selected via --moe_strategy no_quant_dp_normal_deepgemm, gated on bf16 + has_deep_gemm + SM>=9 + no CUDA graph. It is NOT part of "auto" selection, so the default MoE path on existing CUDA deployments is unchanged. deepgemm_wrapper.py changes are backward-compatible and do not affect existing fp8/bf16 callers: - has_deep_gemm() re-checks until the first successful import (then caches True) instead of caching the first result; for normal processes where deep_gemm is importable at import time it returns True on the first call exactly as before. Needed for spawned subprocesses whose sys.path is set up after module import. - Symbol resolution is deferred from import-time to first use (_ensure_initialized); functionally identical, only lazy. - bf16 grouped-GEMM legacy fallback names corrected to the real deep_gemm symbols (gemm_bf16_bf16_bf16_nt*). resolve_symbol() tries the standard name first, so existing resolution is unchanged; this only makes the previously dormant bf16 path resolvable. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…r e2e test_ep_scatter_bf16: the scatter kernels assign output slots with a non-deterministic tl.atomic_add, so row order within an expert is not fixed. Rewrite all checks to be order-independent by following output_index (the authoritative token->slot map the gather stage uses) instead of assuming a token-sequential layout. This replaces the previously order-sensitive torch.equal comparisons that could spuriously fail. Also: - fix the roundtrip tests to use hidden_size % 512 == 0 (ep_gather BLOCK_D=512); - size the contiguous stress test's per-expert capacity from the real routing histogram (bincount, aligned) instead of a fixed count, matching the executor's allocation and avoiding under-allocation. deepgemm_bf16_hybrid_executor: new end-to-end test for the bf16 DeepEP-normal hybrid executor (scatter -> grouped GEMM -> silu_and_mul -> grouped GEMM -> gather with router weight), covering both the masked (small token count) and contiguous (large token count) runtime paths against a plain-torch reference. Tagged open_skip + H20 (requires deep_gemm + SM>=9). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…d bf16 init Review fixes on the bf16 DeepEP-Normal deepgemm MoE path. All changes are confined to the opt-in no_quant_dp_normal_deepgemm path, the bf16 deep_gemm wrappers, and tests — the fp8 path and other default/cross-arch paths are unaffected. - DeepGemmBf16HybridExecutor.execute: handle empty rank before dispatch. DeepEP small-batch / skewed routing can leave a rank with token_num == 0; that would otherwise enter the masked path with alignment == 0 and launch 0-grid Triton scatter / 0-size DeepGEMM. Return an empty same-shape [0, K] bf16 output. - DeepGemmBf16HybridExecutor (contiguous path): build the per-expert token-count tensor with .to(device=hidden_states.device, non_blocking=True) instead of .cuda() so it honors the hidden-states device invariant. - deepgemm_wrapper: decouple bf16 symbol resolution from the fp8 path. Previously _ensure_initialized() resolved fp8 AND bf16 symbols together, so an older deep_gemm build missing the bf16 symbols would raise from _ensure_initialized() and break the fp8 wrappers. Now _ensure_initialized() resolves only _FP8_SYMBOLS (raises if missing — fp8 is core), while _ensure_bf16_initialized() resolves _BF16_SYMBOLS independently and tolerantly (missing -> impls stay None, never propagate). bf16 wrappers call _ensure_bf16_initialized(); fp8 wrappers keep _ensure_initialized(). has_deep_gemm_bf16_grouped() reports False (never raises) when the bf16 symbols are unavailable. - deepgemm_wrapper bf16 grouped wrappers: reject a non-default compiled_dims explicitly (NotImplementedError) instead of silently ignoring it; the wrapper does not forward compiled_dims (forwarding perturbs bf16 numerics on this shared path). No current caller passes a non-"nk" value. - CudaNoQuantDpNormalDeepGemmStrategy: fail fast at selection via has_deep_gemm_bf16_grouped(), and gate on the explicit opt-in moe_strategy FIRST with a short-circuit return (ConditionChecker does not stop at the first failed check, so this keeps the probe from running for non-opt-in / "auto" configs). - Tests: empty-rank (token_num==0) executor cases (ep1 + ep2); ep_size>1 executor coverage (rank 0/1, _to_local_expert_ids mapping + masking) vs a per-rank torch reference; strategy selection pos/neg; has_deep_gemm_bf16_grouped no-raise and _ensure_bf16_initialized tolerance; executor test skip uses has_deep_gemm_bf16_grouped() to match the gating. The ep_kernels contiguous padding-row m_indices contract is left to the feature kernel owner (padding output is discarded by the gather; a real fix needs a kernel signature change + a deep_gemm -1 skip contract). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

LLLLKKKK · 2026-06-29T11:50:20Z

AI Code Review - PR #1111

Status: LGTM

Summary: P0/0 · P1/0 · P2/5 · P3/3

lgtm ready to ci

Non-blocking Suggestions

P2

contiguous 路径 m_indices 注释与 torch.empty 行为不符 @ rtp_llm/models_py/modules/factory/fused_moe/impl/cuda/executors/deepgemm_bf16_hybrid_executor.py:334
- 建议：修正注释为 'fwd_kernel_ep_scatter_1 已对所有 aligned slot 写入 expert index，clamp 为防御性保护'，或改用 torch.zeros 使注释与实现一致。
Executor check_conditions 仅检查 has_deep_gemm() 而非 has_deep_gemm_bf16_grouped() @ rtp_llm/models_py/modules/factory/fused_moe/impl/cuda/executors/deepgemm_bf16_hybrid_executor.py:91
- 建议：将 check_conditions 中的 checker.check(has_deep_gemm()) 替换为 checker.check(has_deep_gemm_bf16_grouped())，与 strategy 侧保持一致的 fail-fast 语义。
contiguous 路径 padding 行参与不必要的 GEMM 计算 @ rtp_llm/models_py/modules/factory/fused_moe/impl/cuda/executors/deepgemm_bf16_hybrid_executor.py:305
- 建议：已知与 FP8 hybrid executor 一致。如后续优化，可传实际 token counts 到 scatter 并仅对实际行构建 m_indices，output buffer 按 aligned 尺寸分配即可。
contiguous 路径 .cpu().tolist() fallback 在 forward hot path 触发 GPU 同步 @ rtp_llm/models_py/modules/factory/fused_moe/impl/cuda/executors/deepgemm_bf16_hybrid_executor.py:300
- 建议：确保 DeepEP Normal router 总是填充 expert_num_tokens_cpu 避免走此 fallback。与 FP8 executor 一致，非本 PR 引入。
moe_group_args.py 中 no_auant 与 no_quant 拼写不统一 @ rtp_llm/server/server_args/moe_group_args.py:169
- 建议：记录此拼写差异为已知问题。如果短期不打算修正旧名称（向后兼容），可以在 choices 列表或文档中加注释说明 'no_auant' 是历史拼写错误。新策略保持正确拼写是合理的。

P3

masked/contiguous 路径 configure_deep_gemm_num_sms 作用域不一致 @ rtp_llm/models_py/modules/factory/fused_moe/impl/cuda/executors/deepgemm_bf16_hybrid_executor.py:342
- 建议：统一两个路径的 configure_deep_gemm_num_sms 作用域：要么都全包裹，要么都只包裹 GEMM 调用。推荐与 masked 路径一致，全包裹 contiguous 路径。
ep_scatter_bf16 contiguous 测试 roundtrip 使用非去重 topk_ids @ rtp_llm/models_py/triton_kernels/moe/test/test_ep_scatter_bf16.py:918
- 建议：可统一用 _distinct_topk，或添加注释说明刻意测试重复路由场景。不影响正确性。
executor_type() 返回 DEEPGEMM_MASKED 但实际支持 masked 和 contiguous 两种模式 @ rtp_llm/models_py/modules/factory/fused_moe/impl/cuda/executors/deepgemm_bf16_hybrid_executor.py:78
- 建议：考虑新增 DEEPGEMM_BF16_HYBRID enum 值，或与 FP8 executor 统一返回 DEEPGEMM_CONTINUOUS。当前注释已说明仅用于 logging，不影响功能。

Checklist Violations (4 fail / 92 total)

General Principles Checklist

[6.1] Architecture — 错误语义：fail-fast/retry/fallback/silent 行为显式 → issue Executor check_conditions 仅检查 has_deep_gemm() 而非 has_deep_gemm_bf16_grouped()
executor check_conditions 使用 has_deep_gemm()（包级别检查），strategy 使用 has_deep_gemm_bf16_grouped()（符号级别检查），fail-fast 语义不一致。executor 可能在包存在但缺少 bf16 grouped 符号时通过检查。

RTP-LLM Checklist

[B] 正确性与逻辑 — 逻辑错误、off-by-one、null/zero 检查 → issue contiguous 路径 m_indices 注释与 torch.empty 行为不符
_executor 第 334 行注释 'from torch.empty initialization' 与实际行为不符——torch.empty 不初始化为零。实际安全性来自 fwd_kernel_ep_scatter_1 已填充所有 aligned slot。注释误导开发者对 padding slot 内存状态的理解。
[D] 性能 — 设备同步 .item()/.cpu()/.tolist() 在关键路径 → issue contiguous 路径 .cpu().tolist() fallback 在 forward hot path 触发 GPU 同步
execute_contiguous line 300: 当 expert_num_tokens_cpu 为 None 时走 .cpu().tolist() fallback 触发 GPU 同步。与 FP8 executor line 382-384 一致，非本 PR 引入但被复制。

Python Static-First Checklist

[P.B] 错误处理 — 禁止 bare except 或静默吞异常 → checklist-only
_ensure_bf16_initialized 中 except Exception: pass 静默吞掉所有异常。设计文档明确说明这是有意行为（bf16 符号可选，解析失败不应传播到 fp8 路径），但不符合 P.B 规范。建议收窄为 except (RuntimeError, AttributeError, ImportError) 或至少添加 debug 级别日志。

Strengths

fp8/bf16 符号解析完全解耦：_ensure_bf16_initialized 独立于 _ensure_initialized，旧版 deep_gemm 缺少 bf16 符号不会影响已有 fp8 路径
测试覆盖全面：scatter kernel 单测（basic/topk/negative_id/roundtrip/large_random）、executor 端到端测试（masked/contiguous 双路径 × ep_size=1/2 × 空 rank）、策略正负用例测试
策略设计为 opt-in（需显式指定 --moe_strategy=no_quant_dp_normal_deepgemm），不改变默认 CUDA MoE 行为，回滚只需移除参数

LLLLKKKK · 2026-07-01T11:19:35Z

AI Code Review - PR #1111

Status: BLOCKING

Summary: P0/0 · P1/0 · P2/1 · P3/2

Non-blocking Suggestions

P2

m_indices.clamp_ 的注释记录了错误的初始化假设 @ rtp_llm/models_py/modules/factory/fused_moe/impl/cuda/executors/deepgemm_bf16_hybrid_executor.py:337
- 建议：修正注释，如实反映 _fwd_kernel_ep_scatter_1 已填充所有 slot 的事实，将 clamp_ 标注为纯防御性措施（而非修正未初始化值）。例如：# Defensive: all m_indices slots are filled by _fwd_kernel_ep_scatter_1, clamp_ guards against any future kernel change that might leave padding slots uninitialized.

P3

_to_local_expert_ids 每次前向调用分配标量张量 @ rtp_llm/models_py/modules/factory/fused_moe/impl/cuda/executors/deepgemm_bf16_hybrid_executor.py:134
- 建议：在 __init__ 中预创建并缓存该标量张量为实例属性 self._neg_one，在 _to_local_expert_ids 中直接引用。
PR 混合了不相关的 OOB 修复与 MoE 功能 @ rtp_llm/models_py/triton_kernels/causal_conv1d/causal_conv1d.py:770
- 建议：后续类似情况建议将独立 bugfix 拆分为单独 PR，便于独立 review、合并和回滚。当前 PR 中 commit 已是原子的，不需要修改。

Checklist Violations (3 fail / 56 total)

General Principles Checklist

[6.1] Quality — 逻辑变更未混入无关格式化 → issue PR 混合了不相关的 OOB 修复与 MoE 功能
commit b968b66 修复了 causal_conv1d 和 fused_recurrent 的 block_map OOB 越界读取，与 bf16 MoE 功能完全无关。虽然各 commit 本身是原子的（可单独 bisect），但将不相关修复放在同一 PR 中增加了 review 和 rollback 的认知负担。
[6.1] Software Engineering — KISS/YAGNI：无投机性抽象 → issue ``_to_local_expert_ids 每次前向调用分配标量张量
`torch.tensor(-1, device=local.device, dtype=local.dtype)` 在每次 `execute()` 调用时创建一个新的标量张量。虽然 `torch.where` 内部可以高效处理标量广播，但在推理热路径上频繁的小张量分配会增加 Python/CUDA 开销。

RTP-LLM Checklist

[I] 代码质量 — 同一功能用统一工具函数 → issue ``m_indices.clamp_ 的注释记录了错误的初始化假设
注释称 "leaves padding slots at 0 (from torch.empty initialization)"，暗示 `torch.empty` 会零初始化。实际上 `torch.empty` 不保证零初始化。更关键的是，`ep_scatter_bf16` 内部调用的 `fwd_kernel_ep_scatter_1` 会为所有 aligned 范围的 slot 填充合法 expert ID（因为传入的 `num_recv_tokens_per_expert_gpu` 是 aligned count，且 BLOCK_E == EXPERT_ALIGNMENT == 128，每个 expert 的 aligned count 是 BLOCK_E 的整数倍）。因此 `m_indices` 在 clamp 执行前已全部是合法值，clamp 本身是 no-op。注释描述了一个不存在的不变量（"unoccupied trailing slots"），可能误导后续维护者对 phase-1 kernel 与 m_indices 之间的实际契约产生错误认知。_

Strengths

fp8/bf16 符号解析完全解耦：bf16 grouped symbol 缺失不会影响 fp8 路径的初始化，_ensure_bf16_initialized 的容错设计确保了策略枚举期间不会因 symbol probe 而中断。
混合 masked/contiguous 执行器设计合理：decode（小 token 数）使用 3D masked 布局避免内存浪费，prefill（大 token 数）使用 flat contiguous 布局避免对齐膨胀，阈值 masked_max_token_num 可通过 CLI 参数调节。
显式 opt-in 策略（--moe_strategy no_quant_dp_normal_deepgemm）配合短路检查，避免在 auto 策略枚举时触发 bf16 symbol probe，设计周到。
测试覆盖全面：executor e2e 测试覆盖了 masked/contiguous 两条路径、单/多 rank EP、空 rank 边界条件；scatter kernel 测试使用 order-independent 验证方式正确处理了 tl.atomic_add 的非确定性顺序。
causal_conv1d 和 fused_recurrent 的 OOB 修复有配套回归测试，明确记录了 CUDA（静默读取邻近内存）和 PPU（硬件异常）的不同表现。

LLLLKKKK · 2026-07-02T08:28:14Z

AI Code Review - PR #1111

Status: BLOCKING

Summary: P0/0 · P1/0 · P2/1 · P3/2

Non-blocking Suggestions

P2

_causal_conv1d_fwd_kernel 的 block_map 写路径缺少相同的 OOB 防护 @ rtp_llm/models_py/triton_kernels/causal_conv1d/causal_conv1d.py:354
- 建议：考虑在 forward 内核中也添加等效的 if dest_idx // SEQ_SIZE_PER_BLOCK < max_block_size: 防护，与 decode 内核保持一致。不阻塞当前 PR。

P3

contiguous 路径 pin_memory=True 对小张量可能引入不必要开销 @ rtp_llm/models_py/modules/factory/fused_moe/impl/cuda/executors/deepgemm_bf16_hybrid_executor.py:316
- 建议：小张量可直接使用 torch.tensor(..., device=device) 同步创建，或保持现状不变（开销极小，功能正确）。
_to_local_expert_ids 每次调用创建 GPU 标量张量 @ rtp_llm/models_py/modules/factory/fused_moe/impl/cuda/executors/deepgemm_bf16_hybrid_executor.py:134
- 建议：可将 -1 标量张量缓存为实例属性（self._neg_one = torch.tensor(-1, device=..., dtype=torch.int32)）在 __init__ 中初始化，或使用 torch.full_like(local, -1) 与 torch.where 配合（PyTorch 会自动广播标量）。

Checklist Violations (2 fail / 56 total)

General Principles Checklist

[6.1] Architecture — 状态不变量：创建/更新/失败/重试/回滚路径有效 → issue ``_causal_conv1d_fwd_kernel 的 block_map 写路径缺少相同的 OOB 防护
_decode 内核（`_causal_conv1d_update_kernel:776`）已添加 `if write_block_offset < stride_block_map:` 防护，但同文件 forward 内核（`causal_conv1d_fwd_kernel:354`）的 `dest_idx // SEQ_SIZE_PER_BLOCK` 仍直接用于 `tl.load(block_map_ptr + ... + dest_idx // SEQ_SIZE_PER_BLOCK)` 而无边界检查。虽然 forward 路径因 block_map 在 prefill 时已为完整上下文预分配而不太可能触发 OOB，但一致性防护可防止未来如果上游 block_map 分配逻辑有 bug 时出现类似的静默损坏。此为 pre-existing 模式，风险较低。

RTP-LLM Checklist

[I] 代码质量 — 同一功能用统一工具函数 → issue ``_to_local_expert_ids 每次调用创建 GPU 标量张量
`torch.tensor(-1, device=local.device, dtype=local.dtype)` 在每次 `execute()` 调用中创建一个新的 GPU 标量张量。该方法在 masked 和 contiguous 两条路径中各调用一次，每次推理请求至少产生一次分配。

Strengths

OOB 修复使用 if write_block_offset < stride_block_map: 分支替代 masked load，避免了某些硬件在 predicate 为 False 时仍评估加载地址的问题，是正确的防御模式。
bf16 MoE 路径严格 opt-in，不纳入 "auto" 策略选择，零回归风险。CudaNoQuantDpNormalDeepGemmStrategy.check_conditions 先检查 moe_strategy 再做符号探测，避免对无关配置产生枚举开销。
_ensure_bf16_initialized 与 _ensure_initialized 解耦设计精巧：bf16 符号解析失败时容忍（impls 留 None），fp8 路径完全不受影响。
DeepGemmBf16HybridExecutor 测试覆盖全面：masked/contiguous 两条路径、ep_size=1 和 ep_size=2（含 rank=1 偏移）、空 rank (token_num=0) 边界、以及 bf16 scatter 内核的顺序无关验证。
空 rank 防护（token_num == 0 → 直接返回 empty tensor）避免了 0-grid Triton scatter / 0-size DeepGEMM 问题。

LLLLKKKK · 2026-07-03T04:38:15Z

AI Code Review - PR #1111

Status: BLOCKING

Summary: P0/0 · P1/0 · P2/2 · P3/5

Non-blocking Suggestions

P2

DeepGemmBf16HybridExecutor 忽略 activation 参数，非 silu 时会静默产出错误结果 @ rtp_llm/models_py/modules/factory/fused_moe/impl/cuda/executors/deepgemm_bf16_hybrid_executor.py:137
- 建议：在 execute()（或每条子路径入口）显式 assert activation in {"silu", "SiGLU"}, ...，与 cutlass 类 executor 的写法对齐；并在 check_conditions 里加一条 checker.check(config.activation_type in {"silu", "SiGLU"})，让策略层直接拒绝不匹配的 activation，避免落到运行期。
DeepGemmBf16HybridExecutor.check_conditions 只检 has_deep_gemm()，未探测 bf16 grouped 符号 @ rtp_llm/models_py/modules/factory/fused_moe/impl/cuda/executors/deepgemm_bf16_hybrid_executor.py:81
- 建议：将 executor 自身的 check_conditions 里的 checker.check(has_deep_gemm()) 换成 checker.check(has_deep_gemm_bf16_grouped())，或直接调 has_deep_gemm_bf16_grouped()，让 executor 独立且完整地表达自己对 deep_gemm 版本的要求，避免依赖某个 strategy 的额外前置检查。

P3

_to_local_expert_ids 每次 forward 都新建一个标量 tensor @ rtp_llm/models_py/modules/factory/fused_moe/impl/cuda/executors/deepgemm_bf16_hybrid_executor.py:128
- 建议：直接写 torch.where((local >= 0) & (local < self.num_experts_per_partition), local, -1)（PyTorch 支持标量 other），或在 __init__ 缓存 self._neg_one = torch.tensor(-1, dtype=torch.int32) 后按需 .to(device)。
execute_contiguous 的 m_indices.clamp_ 注释与实际 kernel 行为不符 @ rtp_llm/models_py/modules/factory/fused_moe/impl/cuda/executors/deepgemm_bf16_hybrid_executor.py:334
- 建议：把注释改为反映实际情况，例如「_fwd_kernel_ep_scatter_1 prefills m_indices for the full aligned span (occupied + padding), so all slots already carry a valid expert id in [0, num_experts_per_partition); the clamp_ is a defensive guard and can be removed once the invariant is verified by test」。若确认无需兜底可直接删掉 clamp_。
_ensure_bf16_initialized 用 bare except Exception 会掩盖非符号缺失的错误 @ rtp_llm/models_py/kernels/cuda/deepgemm_wrapper.py:229
- 建议：收窄到只吞 _lazy_init_deep_gemm 明确会为「符号缺失」抛出的类型（当前是 RuntimeError from raise RuntimeError(...) 的 except AttributeError 分支），并对 ImportError 明确不吞或走一条 log 分支：

except RuntimeError as e:
logger.debug("deep_gemm bf16 grouped symbols unavailable: %s", e)

避免把上游 import 崩溃也当成「bf16 不可用」静默处理。

has_deep_gemm() 未成功前不缓存导致每次调用都重跑 find_spec @ rtp_llm/models_py/kernels/cuda/deepgemm_wrapper.py:71
- 建议：引入负缓存 + TTL/次数上限，或改用「只在 first N 次调用内重探测」的策略；也可以让子进程晚初始化路径显式调用一个 refresh_deep_gemm() 之类的 API 主动清缓存，而不是牺牲常态下的缓存效率。
bf16 ep_scatter_bf16 / ep_scatter_v2_bf16 与已有 ep_scatter* 大量重复 @ rtp_llm/models_py/triton_kernels/moe/ep_kernels.py:186
- 建议：用 USE_SCALE: tl.constexpr 开关合并 _fwd_kernel_ep_scatter_2 与 _fwd_kernel_ep_scatter_2_bf16（同理 v2 版本），把 scale 相关 load/store 放在 if USE_SCALE: 分支下。Triton constexpr 会在编译期消除分支，性能等价。Python wrapper 也可以合并成一个 ep_scatter 参数化 recv_x_scale=None。

Checklist Violations (6 fail / 56 total)

General Principles Checklist

[6.1] Architecture — 可观测性：日志/指标/超时可操作、非噪声 → issue has_deep_gemm()` 未成功前不缓存导致每次调用都重跑 `find_spec
新实现移除了 @functools.cache，只在 available == True 时缓存。设计意图是子进程 late sys.path 场景下允许「一次失败后重试」，但副作用是：在 deep_gemm 从未安装的环境（例如 CPU-only 单元测试、非 CUDA CI lane），每次 has_deep_gemm() 都会再次调用 importlib.util.find_spec("deep_gemm")——find_spec 会走一遍 meta_path finders 与 sys.path，是非平凡开销。而策略枚举里每个 strategy 的 check_conditions 都可能间接调用它（executor.check_conditions 在本 PR 里就直接调），累计次数可观。
[6.1] Architecture — 错误语义：fail-fast/retry/fallback/silent 行为显式 → issue ``_ensure_bf16_initialized用 bareexcept Exception` 会掩盖非符号缺失的错误`
_`try: _lazy_init_deep_gemm(_BF16_SYMBOLS) except Exception: pass` 意图是吞掉「老 deep_gemm 缺 bf16 grouped 符号」的 `RuntimeError`，但会同时吞掉 `ImportError`（deep_gemm 本身 broken）、`ValueError`（`_lazy_init_deep_gemm` 对 unknown symbols 会 raise）等无关错误。真出问题时排查会很困难，只能靠观察 `m_grouped_bf16_gemm_nt_masked_impl is None` 反推。
[6.1] Software Engineering — DRY：重复非平凡逻辑被抽取或显式复用 → issue bf16 ep_scatter_bf16/ep_scatter_v2_bf16与已有ep_scatter* 大量重复
_fwd_kernel_ep_scatter_2_bf16 与 _fwd_kernel_ep_scatter_2 的差异只是移除 scale 分支；_fwd_kernel_ep_scatter_2_v2_bf16 与 _fwd_kernel_ep_scatter_2_v2 同理。四份 kernel + 两个 python wrapper 都是"复制 + 删 scale"，后续 kernel 逻辑（cursor 排布/grid tuning/OOB 保护）任一修改都需要同步维护多份，容易漂移。已有 _fwd_kernel_ep_scatter_1_v2 就是本 PR bf16 v2 路径复用的，说明抽象是可行的。

RTP-LLM Checklist

[I] 代码质量 — 同一功能用统一工具函数 → issue ``DeepGemmBf16HybridExecutor忽略activation` 参数，非 silu 时会静默产出错误结果`
`execute()` 接收 `activation: str` 但从不消费——`execute_masked` 硬编码 `silu_mul_masked_bf16_no_post_quant_fwd`，`execute_contiguous` 硬编码 `silu_and_mul`。仓库其他 executor（`cutlass_moe.py:149`, `cutlass_moe.py:377`, `cutlass_w4a8_moe.py:160/394`）都用 `assert activation == "SiGLU"` 做显式契约校验；本 executor 完全不校验。策略层 `check_conditions` 也未限制 `config.activation_type`。当调用方传入 `"gelu"` / `"SiGLU"` 之类合法值（模型 config 完全可能提供），executor 会静默改用 silu 计算，产生错的 fused_expert_output——问题不会在测试或 CI 中被自动捕获，因为测试固定传 `"silu"`。虽然该 strategy 是 opt-in

Python Static-First Checklist

[P.B] 错误处理 — 禁止 bare except 或静默吞异常 → issue ``_ensure_bf16_initialized用 bareexcept Exception` 会掩盖非符号缺失的错误`
_`try: _lazy_init_deep_gemm(_BF16_SYMBOLS) except Exception: pass` 意图是吞掉「老 deep_gemm 缺 bf16 grouped 符号」的 `RuntimeError`，但会同时吞掉 `ImportError`（deep_gemm 本身 broken）、`ValueError`（`_lazy_init_deep_gemm` 对 unknown symbols 会 raise）等无关错误。真出问题时排查会很困难，只能靠观察 `m_grouped_bf16_gemm_nt_masked_impl is None` 反推。
[P.D] 性能反模式 — lru_cache 注意内存泄漏风险 → issue has_deep_gemm()` 未成功前不缓存导致每次调用都重跑 `find_spec
新实现移除了 @functools.cache，只在 available == True 时缓存。设计意图是子进程 late sys.path 场景下允许「一次失败后重试」，但副作用是：在 deep_gemm 从未安装的环境（例如 CPU-only 单元测试、非 CUDA CI lane），每次 has_deep_gemm() 都会再次调用 importlib.util.find_spec("deep_gemm")——find_spec 会走一遍 meta_path finders 与 sys.path，是非平凡开销。而策略枚举里每个 strategy 的 check_conditions 都可能间接调用它（executor.check_conditions 在本 PR 里就直接调），累计次数可观。

Strengths

KV-cache 边界 OOB 修复（causal_conv1d.py 与 fused_recurrent.py）分离读/写路径两处 offset 分别加保护，附带专门的 boundary 回归测试（seq==MAX+1、==MAX+2、==MAX-1），并解释了「部分硬件即使 masked 也会 evaluate load address」的语义原因，属于扎实的 root-cause fix。
has_deep_gemm_bf16_grouped() 在 strategy 层做真实符号探测，将「老 deep_gemm 缺 bf16 grouped 符号」从首次 forward 的 late failure 提前到策略枚举时的 fail-fast，并显式说明「decoupled from fp8 path」的意图；_ensure_bf16_initialized 用独立 flag 与 tolerant except 避免污染 fp8 路径。
CudaNoQuantDpNormalDeepGemmStrategy.check_conditions 首行 moe_strategy == "no_quant_dp_normal_deepgemm" 前置守门 + return 短路，注释解释了 ConditionChecker 不会 stop-at-first-fail 的行为，避免在 "auto" 等无关配置枚举时反复调用 has_deep_gemm_bf16_grouped 探测。
executor 对「rank tokens==0」提前返回，避免 masked 路径 alignment==0 触发 0-grid Triton launch / 0-size DeepGEMM；测试覆盖了 ep_size=1 与 ep_size=2, rank=1 两种空 rank 情形。
m_grouped_bf16_gemm_nt_contiguous/masked 对不支持的 compiled_dims 显式 NotImplementedError，而不是静默接受并 forward 一个不再被 kernel 使用的参数——比之前"接受但忽略"的语义更清晰。
测试文件对 EP 采取 simulate-EP 模式，注释说明真实 DeepEP 全 all-to-all 需要 NVSHMEM 多进程、不可单进程复现，并复用 deepep_normal_executor_test.py 的既有模式；correctness check 用 output_index 做 order-independent 校验，回避 atomic_add 分配顺序非确定性——是同类测试里比较扎实的写法。

…nearAttn decode kernels Root cause: the GatedDeltaNet decode kernels silently corrupt data or crash when a multi-turn conversation accumulates enough tokens that the sequence fills all allocated KV-cache blocks. Two independent block_map offsets can go out of bounds: read path: read_block_offset = (sequence_length - 1) // SEQ_SIZE_PER_BLOCK write path: write_block_offset = sequence_length // SEQ_SIZE_PER_BLOCK When sequence_length reaches block_map.size(1) * SEQ_SIZE_PER_BLOCK (+1 for the write offset, +2 for the read offset), the offset reaches block_map.size(1) — one past the end of the block_map row. The OOB block_map read yields a garbage block id; every downstream state load/store then computes an out-of-bounds address, corrupting the KV cache or faulting depending on the platform (some hardware evaluates load addresses even for masked-off lanes, so a masked load does not protect against this). Affected kernels: _causal_conv1d_update_kernel (causal_conv1d.py) fused_recurrent_gated_delta_rule_fwd_kernel (fused_recurrent.py) Fix: causal_conv1d.py: read path — clamp read_block_offset to the last allocated block (stride_block_map - 1) before the block_map load. write path — replace the masked tl.load with an explicit `if write_block_offset < stride_block_map:` branch that fully skips address evaluation on OOB. fused_recurrent.py: write path — wrap the write-state block in `if write_block_offset < max_block_size:` (the read path already guards via `if read_block_id <= 0: return`). Test: TestCausalConv1dMaxSeqLenBoundary covers both boundaries — - write offset OOB at sequence_length == MAX_SEQ_LEN + 1 - read offset OOB at sequence_length == MAX_SEQ_LEN + 2 (garbage block id fed into the conv_state load) Both verify no crash and finite output; the last in-bounds step stays correct. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…aph capture 32k/64k seq_len with DP=8 and 8 capture batch sizes requires 40-50 min for CUDA Graph capture on PPU. The previous 1600s default (and 3000s intermediate fix) was insufficient.

LLLLKKKK · 2026-07-03T12:14:04Z

AI Code Review - PR #1111

Status: BLOCKING

Summary: P0/0 · P1/0 · P2/3 · P3/3

Non-blocking Suggestions

P2

--moe_strategy 新选项拼写与既有选项风格不一致 @ rtp_llm/server/server_args/moe_group_args.py:168
- 建议：要么把新选项名对齐既有 typo 风格（不建议，进一步固化历史错误），要么在 moe_group_args.py 中把三个既有 typo 选项各追加一个正确拼写的别名（并在对应 strategy 的 check_conditions 中同时接受两种拼写），同时补一个 deprecation 注释说明 typo 版将来会移除；或者至少在 --moe_strategy 的 help 文本里明确指出「历史选项使用 no_auant_ 前缀，新增选项使用 no_quant_ 前缀」，避免用户踩坑并方便 grep。
kernel 中用 stride_block_map 作为 size(1) 上界，隐含 block_map 必须行连续 @ rtp_llm/models_py/triton_kernels/causal_conv1d/causal_conv1d.py:706
- 建议：方式一（推荐，风险最低）：在 kernel 的 Python wrapper（如 causal_conv1d_update 与 fused_recurrent_gated_delta_rule）入口断言 block_map.is_contiguous() 或至少 block_map.stride(1) == 1 and block_map.stride(0) == block_map.size(1)，并明确 doc 该约束；同时把 clamp 里的 stride_block_map - 1 改为一个专门传入的 num_block_slots（= block_map.size(1)）参数，语义与实现解耦，未来若 layout 演化不再是隐式陷阱。方式二（最小改动）：至少在 clamp 附近的注释里明确「本 clamp 依赖 block_map 行连续」，并加一条 unit test 用非连续 block_map（如 torch.arange(...).view(2, 2*num_blocks)[:, :num_blocks]）验证行为符合预期或明确 raise。
fused_recurrent 越界修复缺回归测试 @ rtp_llm/models_py/triton_kernels/fla/fused_recurrent.py:119
- 建议：参照 test_casual_conv1d_decode.py::TestCausalConv1dMaxSeqLenBoundary 的写法，在 fla 对应 test 目录下新增至少 2 条对称回归：一条构造 sequence_length == max_block_size * SEQ_SIZE_PER_BLOCK + 1（触发 write_block_offset 越界），一条构造 sequence_length == max_block_size * SEQ_SIZE_PER_BLOCK + 2（触发 load_block_offset 越界），并类似地在 block_map 之后布置一个 huge_block_id 让越界读的坏地址可复现，torch.isfinite(out).all() 断言 + 不 crash 即可判定回归。

P3

DeepGemmBf16HybridExecutor.check_conditions 探测强度弱于同路径 strategy @ rtp_llm/models_py/modules/factory/fused_moe/impl/cuda/executors/deepgemm_bf16_hybrid_executor.py:82
- 建议：把 executor 的 check_conditions 中的 has_deep_gemm() 直接替换为 has_deep_gemm_bf16_grouped()，语义与 strategy 对齐并集中真值来源；或抽出一个共享的 helper（如 _deepgemm_bf16_ready(config)）供 strategy 与 executor 共同复用，避免同一契约在两处漂移。
executor_type() 固定返回 DEEPGEMM_MASKED 会污染监控/日志语义 @ rtp_llm/models_py/modules/factory/fused_moe/impl/cuda/executors/deepgemm_bf16_hybrid_executor.py:60
- 建议：方式一：在 ExecutorType 中新增一个 DEEPGEMM_BF16_HYBRID（或 DEEPGEMM_HYBRID）枚举值并返回；如果只是为了避免加枚举，至少在 execute_contiguous 里显式打一条 logger.info / metric event 明确「contiguous path selected, token_num=…」，让线上可观测到真实分派。方式二：把 executor 的运行时分派拆成两个独立的 Executor 类（DeepGemmBf16MaskedExecutor / DeepGemmBf16ContiguousExecutor），由 strategy 或上层根据 batch 形状选择，彻底消除 nominal type 与实际行为的偏差。
perf_test/server.py 与 MoE feature 无关的超时调整不属于本 PR 范围 @ rtp_llm/test/perf_test/server.py:38
- 建议：拆到独立 commit 或独立 PR，附上「为什么把 timeout 从 X 改到 5000」的说明（单位、影响的模型/场景），本 PR 内至少加一个行内注释交代动机；若确实是 bf16 MoE 大模型启动更慢的观测所迫，把这个背景写进注释即可保留。

Checklist Violations (10 fail / 56 total)

General Principles Checklist

[6.1] Architecture — 兼容性：公开 API/持久数据/配置/环境迁移安全 → issue ``--moe_strategy 新选项拼写与既有选项风格不一致
`choices` 列表内既有的三个无量化选项都写作 `no_auant_ep_low_latency` / `no_auant_cpp` / `no_auant_dp_normal`（明显 typo，缺少 `q`），本 PR 在其后紧邻插入了拼写正确的 `no_quant_dp_normal_deepgemm`。CLI 由 argparse 的 `choices` 严格匹配，用户看到列表后合理外推「no_quant_ep_low_latency」会得到 invalid choice；同时新策略的 `check_conditions` 内以字面量 `"no_quant_dp_normal_deepgemm"` 做匹配，未提供别名，无法兼容 typo 风格。
[6.1] Architecture — 可观测性：日志/指标/超时可操作、非噪声 → issue ``executor_type()固定返回DEEPGEMM_MASKED` 会污染监控/日志语义`
类注释显式说明 `executor_type()` 只用于 logging/registration、实际 masked/contiguous 分派在 `execute()` 内根据 token 数动态决定，但仍固定返回 `ExecutorType.DEEPGEMM_MASKED`。这意味着 prefill 时（走 contiguous 分支）任何依赖该枚举做统计、路由、告警的观测面（metrics tag、trace span 名、CI 分析等）都会看到 "DEEPGEMM_MASKED"，与实际执行路径不一致，后续排查 prefill 性能/正确性问题时容易被误导；同时 `strategy.priority` 在部分代码里由 `router_type.value * 10 + executor_type.value` 计算（见 `test_cuda_strategies.py` 中的既有断言），contiguous 场景优先级也被隐式挂在 masked 上。
[6.1] Architecture — 状态不变量：创建/更新/失败/重试/回滚路径有效 → issue kernel 中用 stride_block_map作为size(1) 上界，隐含 block_map 必须行连续
新增 clamp 表达式为 read_block_offset = tl.minimum(read_block_offset, stride_block_map - 1) 与 write_ok = write_block_offset < stride_block_map；两处用的都是 stride_block_map（block_map.stride(0)）而不是 block_map.size(1)。仅当 block_map 是 [batch, num_blocks] 且行连续（stride(0) == num_blocks）时二者才相等；若上层传入的是 non-contiguous 的切片/view（例如 block_map[:, :N]、来自 torch.as_strided 的视图，或未来做「pad + 复用」优化），stride(0) 会大于实际列数，clamp 反而放行了真正越界的列索引，越界读进入相邻行数据，KV cache 会静默污染但 crash 不复现。fla/fused_recurrent.py 里用的是 `max_bl
[6.1] Architecture — 错误语义：fail-fast/retry/fallback/silent 行为显式 → issue ``executor_type()固定返回DEEPGEMM_MASKED` 会污染监控/日志语义`
类注释显式说明 `executor_type()` 只用于 logging/registration、实际 masked/contiguous 分派在 `execute()` 内根据 token 数动态决定，但仍固定返回 `ExecutorType.DEEPGEMM_MASKED`。这意味着 prefill 时（走 contiguous 分支）任何依赖该枚举做统计、路由、告警的观测面（metrics tag、trace span 名、CI 分析等）都会看到 "DEEPGEMM_MASKED"，与实际执行路径不一致，后续排查 prefill 性能/正确性问题时容易被误导；同时 `strategy.priority` 在部分代码里由 `router_type.value * 10 + executor_type.value` 计算（见 `test_cuda_strategies.py` 中的既有断言），contiguous 场景优先级也被隐式挂在 masked 上。
[6.1] Quality — Commit 原子、message 与行为匹配 → issue ``perf_test/server.py 与 MoE feature 无关的超时调整不属于本 PR 范围
_Diff 中把 `self._server.start_server()` 改为 `self.server.start_server(timeout=5000)`。该改动与 PR 主题「bf16 deepgemm MoE」无逻辑关联，PR 描述与提交信息未看到独立说明，5000 单位（秒/毫秒？）与选择原因也没有注释；一旦 `MagaServerManager.start_server` 的 timeout 语义演化，这一挂靠改动会缺少上下文。
[6.1] Quality — Mega-PR 已拆分为独立变更 → issue ``perf_test/server.py 与 MoE feature 无关的超时调整不属于本 PR 范围
_Diff 中把 `self._server.start_server()` 改为 `self.server.start_server(timeout=5000)`。该改动与 PR 主题「bf16 deepgemm MoE」无逻辑关联，PR 描述与提交信息未看到独立说明，5000 单位（秒/毫秒？）与选择原因也没有注释；一旦 `MagaServerManager.start_server` 的 timeout 语义演化，这一挂靠改动会缺少上下文。
[6.1] Quality — 逻辑变更未混入无关格式化 → issue ``perf_test/server.py 与 MoE feature 无关的超时调整不属于本 PR 范围
_Diff 中把 `self._server.start_server()` 改为 `self.server.start_server(timeout=5000)`。该改动与 PR 主题「bf16 deepgemm MoE」无逻辑关联，PR 描述与提交信息未看到独立说明，5000 单位（秒/毫秒？）与选择原因也没有注释；一旦 `MagaServerManager.start_server` 的 timeout 语义演化，这一挂靠改动会缺少上下文。
[6.1] Software Engineering — DRY：重复非平凡逻辑被抽取或显式复用 → issue ``DeepGemmBf16HybridExecutor.check_conditions 探测强度弱于同路径 strategy
`CudaNoQuantDpNormalDeepGemmStrategy.check_conditions` 使用 `has_deep_gemm_bf16_grouped()` 探测真实的 bf16 grouped 符号，而 executor 的 `check_conditions` 只调用了 `has_deep_gemm()`。虽然当前调用序保证 strategy 通过后 executor 一定通过，两处逻辑不一致仍留下两处「同一契约的不同表达」，未来若有直接以 `executor.check_conditions` 做能力探测的调用（例如新增的 auto-selector 或诊断脚本），会得到「符号可能缺失但 executor 声称支持」的错觉。
[6.1] Tests — 新逻辑有聚焦单测 + 相关集成/smoke 测试 → issue ``fused_recurrent 越界修复缺回归测试
`fused_recurrent_gated_delta_rule_fwd_kernel` 里 `load_block_offset` 与 `write_block_offset` 越界的修复思路与 `causal_conv1d` 完全对称（`tl.minimum` clamp 到最后一个 slot + `write_ok` mask），但本 PR 只为 `causal_conv1d` 补了 `test_no_crash_at_max_seq_len_boundary` / `test_no_crash_when_read_offset_out_of_bounds` / `test_output_correct_one_before_boundary` 三条边界回归；`fla/fused_recurrent.py` 侧没有任何新增测试文件与用例，也没有在既有 `fla` 测试中添加对应场景。日后若 clamp 逻辑因重构（例如把 `max_block_size` 参数改名或拆分）而回退，回归无法在 CI 里被捕获。
[6.1] Tests — 边界 case 覆盖（空、单元素、最大值） → issue ``fused_recurrent 越界修复缺回归测试
`fused_recurrent_gated_delta_rule_fwd_kernel` 里 `load_block_offset` 与 `write_block_offset` 越界的修复思路与 `causal_conv1d` 完全对称（`tl.minimum` clamp 到最后一个 slot + `write_ok` mask），但本 PR 只为 `causal_conv1d` 补了 `test_no_crash_at_max_seq_len_boundary` / `test_no_crash_when_read_offset_out_of_bounds` / `test_output_correct_one_before_boundary` 三条边界回归；`fla/fused_recurrent.py` 侧没有任何新增测试文件与用例，也没有在既有 `fla` 测试中添加对应场景。日后若 clamp 逻辑因重构（例如把 `max_block_size` 参数改名或拆分）而回退，回归无法在 CI 里被捕获。

Strengths

测试覆盖非常完整：deepgemm_bf16_hybrid_executor_test.py 同时覆盖 masked/contiguous 两条路径、ep_size=1、ep_size=2 且 rank=0/1、以及空 rank 场景；test_ep_scatter_bf16.py 用 output_index 做「非顺序敏感」校验并同时覆盖 basic/topk/负 expert id/roundtrip/large_random；test_cuda_strategies.py 增加了正向 + 4 种负向 disqualify 用例，并新增了 has_deep_gemm_bf16_grouped 与 _ensure_bf16_initialized 的独立行为回归。
DeepGEMM wrapper 拆分为 fp8/bf16 两条独立初始化路径的重构方向正确：fp8 缺失符号保持 hard error（对 fp8 用户是 broken build），bf16 缺失符号仅使新策略不可选，不再牵连 fp8 用户；has_deep_gemm_bf16_grouped() 显式在 strategy 选择时探测真实符号，避免了「选中之后 first forward 才 raise」的坏体验。
新策略明确 opt-in（checker.check(config.moe_strategy == "no_quant_dp_normal_deepgemm") + 短路 return），不进 auto 选择，控制了对现有 CUDA MoE 默认路径的影响面。
causal_conv1d 与 fused_recurrent 的越界修复思路清晰：tl.minimum 把 offset clamp 到最后一个 slot（保证 load 安全），额外用 write_ok mask 屏蔽真正越界的写，避免了单纯 if 分支在 PPU 上 CUDA Graph capture 时的 phi/寄存器问题；causal_conv1d 侧还给出了 test_no_crash_at_max_seq_len_boundary 与 test_no_crash_when_read_offset_out_of_bounds 两条明确定位读/写两条 OOB 路径的回归测试，并特意布置 huge_block_id 让越界读的坏地址可复现。
Executor 里对 empty rank（token_num==0）做了显式短路返回，避免下游 masked 路径以 alignment==0 触发 0-grid Triton scatter / 0-size DeepGEMM。
bf16 wrapper 显式 reject 非 "nk" 的 compiled_dims（NotImplementedError + 明确消息），避免历史上「forwarding compiled_dims 会扰动 bf16 numerics」的隐坑被静默触发。

Tanmo-ai requested a review from LLLLKKKK as a code owner June 17, 2026 01:59

Tanmo-ai force-pushed the feature/moe-bf16-deepep-deepgemm branch 2 times, most recently from e4564ea to 68b0a44 Compare June 17, 2026 08:54

Tanmo-ai force-pushed the feature/moe-bf16-deepep-deepgemm branch from 2012875 to 8e0e71d Compare June 17, 2026 12:01

Tanmo-ai force-pushed the feature/moe-bf16-deepep-deepgemm branch from 8e0e71d to 0a39628 Compare June 17, 2026 14:25

Tanmo-ai force-pushed the feature/moe-bf16-deepep-deepgemm branch from 0a39628 to e284541 Compare June 18, 2026 03:05

Tanmo-ai force-pushed the feature/moe-bf16-deepep-deepgemm branch from e284541 to cc2025a Compare June 18, 2026 04:01

Tanmo-ai force-pushed the feature/moe-bf16-deepep-deepgemm branch from cc2025a to 9ecf9d8 Compare June 18, 2026 05:28

Tanmo-ai force-pushed the feature/moe-bf16-deepep-deepgemm branch from 9ecf9d8 to 2658cff Compare June 18, 2026 05:58

Tanmo-ai force-pushed the feature/moe-bf16-deepep-deepgemm branch from 2658cff to 81d0d35 Compare June 28, 2026 00:54

Tanmo-ai force-pushed the feature/moe-bf16-deepep-deepgemm branch 2 times, most recently from c9bfddc to 92943ce Compare June 29, 2026 02:51

Tanmo-ai and others added 3 commits June 29, 2026 17:59

Tanmo-ai force-pushed the feature/moe-bf16-deepep-deepgemm branch from 92943ce to 2594c21 Compare June 29, 2026 10:10

Tanmo-ai force-pushed the feature/moe-bf16-deepep-deepgemm branch from 03294ae to 363f9c3 Compare July 2, 2026 08:06

Tanmo-ai force-pushed the feature/moe-bf16-deepep-deepgemm branch 3 times, most recently from 12fdee5 to cfe0fab Compare July 2, 2026 16:29

Tanmo-ai force-pushed the feature/moe-bf16-deepep-deepgemm branch from cfe0fab to 4bd6f1f Compare July 3, 2026 07:31

fix(perf): increase server startup timeout to 5000s for large CUDA Gr…

8582fac

…aph capture 32k/64k seq_len with DP=8 and 8 capture batch sizes requires 40-50 min for CUDA Graph capture on PPU. The previous 1600s default (and 3000s intermediate fix) was insufficient.

Uh oh!

Conversation

Tanmo-ai commented Jun 17, 2026

Summary

Changes

deepgemm_wrapper.py changes (backward-compatible)

Testing

Uh oh!

CLAassistant commented Jun 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

LLLLKKKK commented Jun 17, 2026

AI Code Review - PR #1111

Blocking Issues

P1

Non-blocking Suggestions

P2

Checklist Violations (5 fail / 104 total)

Strengths

Uh oh!

LLLLKKKK commented Jun 17, 2026

AI Code Review - PR #1111

Non-blocking Suggestions

P2

Checklist Violations (7 fail / 104 total)

Strengths

Uh oh!

LLLLKKKK commented Jun 17, 2026

AI Code Review - PR #1111

Checklist ✅ (104 items passed)

Strengths

Uh oh!

LLLLKKKK commented Jun 17, 2026

AI Code Review - PR #1111

Non-blocking Suggestions

P2

Checklist Violations (2 fail / 56 total)

Strengths

Uh oh!

LLLLKKKK commented Jun 17, 2026

AI Code Review - PR #1111

Non-blocking Suggestions

P2

Checklist Violations (5 fail / 104 total)

Strengths

Uh oh!

LLLLKKKK commented Jun 18, 2026

AI Code Review - PR #1111

Blocking Issues

P1

Non-blocking Suggestions

P2

Checklist Violations (4 fail / 60 total)

Strengths

Uh oh!

LLLLKKKK commented Jun 18, 2026

AI Code Review - PR #1111

Blocking Issues

P1

Non-blocking Suggestions

P2

Checklist Violations (3 fail / 104 total)

Strengths

Uh oh!

LLLLKKKK commented Jun 18, 2026

AI Code Review - PR #1111

Blocking Issues

P1

Checklist Violations (5 fail / 104 total)

Strengths

Uh oh!

LLLLKKKK commented Jun 18, 2026

AI Code Review - PR #1111

Checklist ✅ (56 items passed)

Strengths

Uh oh!

wht21 commented Jun 18, 2026

Uh oh!

LLLLKKKK commented Jun 28, 2026

AI Code Review - PR #1111

CLAassistant commented Jun 17, 2026 •

edited

Loading